Table of Contents


Authors

  • Jerry Chao (JYC2171)
  • Sal El-Sadek (SNE2114)
  • Kyung Suk Lee (KL3069)
  • Lusha Liang (LL3344)

Introduction

Welcome to our Final Project for P8105, leveraging the skills we have developed over the semester to analyze a dataset of pediatric patients with COVID-19 infection and the predictors associated with hospitalization. In this project, we present statistical analyses using logistic regression and exploratory visualizations of the dataset, incorporating Plotly, Dashboard, and Shiny.

Project Motivation

While the clinical characteristics and hospital courses of patients with COVID-19 have been previously described, particularly in adults, the risk factors associated with hospitalization in pediatric patients has been understudied 1. Fortunately, COVID-19 seems to affect children generally less severely than adults 2. However, some children do become ill enough to require hospitalization. Moreover, as the number of cases continues to rise across the country, COVID-19 cases in pediatric patients also continues to increase 3.

Here, we present our analyses of a dataset of 375 pediatric patients who returned a positive SARS-CoV-2 RT PCR test at a single tertiary care medical center in New York City from the beginning of the COVID-19 pandemic until October 1, 2020.

The Dataset

The dataset contains de-identified patient health information consisting of age (from 0 to up to 23 years), demographic data (including biological sex, zip code of residence, and a socioeconomic status variable), whether there was a preceding visit to the Emergency Department, any past medical history of asthma, obesity, and/or diabetes, and whether there was an admission to the intensive care unit.

The goal of this project was to perform an exploratory analysis of this dataset in order to begin to assess the risk factors associated with pediatric hospitalization for COVID-19 infection. Prior studies have shown that clinical features including low oxygen saturation and abnormal chest imaging are predictive of the need for hospitalization. Our focus in this present study was more on demographic characteristics and pre-existing comorbidities. We hypothesize that increasing age, a past medical history of asthma, obesity, and diabetes are positively associated with hospitalization.

Prior Studies

Prior studies in children have focused on the clinical characteristics after hospitalization, as well as hospital course 4, 5. In particular, some children have required admission to the intensive care unit and some have developed a multi-system inflammatory syndrome 6. Fortunately, most children in the community seem to fare well with few developing critical illness.

Tidying The Data

First we load our necessary packages and set ggplot preferences.

library(tidyverse)
library(usa)
library(mice)

knitr::opts_chunk$set(
  fig.width = 6,
  fig.asp = .6,
  out.width = "90%")

theme_set(theme_minimal() + theme(legend.position = "bottom"))

options(
  ggplot2.continuous.colour = "viridis",
  ggplot2.continuous.fill = "viridis")

scale_colour_discrete = scale_color_viridis_d
scale_fill_discrete = scale_fill_viridis_d

knitr::opts_chunk$set(comment = NA, message = FALSE, warning = FALSE, echo = TRUE)

Now, we will load the data and take a look.

# Load the data
ped_covid =
  read_csv("./data/p8105_final_ped_covid.csv")

This is a de-identified dataset of pediatric patients from a tertiary care medical center in New York City who tested positive for COVID-19 via the SARS-CoV-2 RT PCR test. The age range in the study is 0 to 22 years of age. An id number for each patient has been randomly generated. There are 375 rows (patients) and 30 columns in this dataset. The variables included in this dataset are id, admitted, age, date_of_birth, ethnicity, gender, censusblock, censusblockgroup, censustract, city, race, ses, state, zip_code_set, eventdatetime, outcomeadmission_admission_1inpatient_admit_service, bmi_yes_or_no, bmi_event_date_time, bmi_value, asthma_date_time, asthma_dx, diabetes_date_time, diabetes_dx, icu_yes_no, icu_date_time, systolic_bp_event_date_time, systolic_bp_value, ed_yes_no_0_365_before, admission_primary_dx, admission_apr_drg. Some of the important variables are date and time of positive covid test (“eventdatetime”), whether the patient was admitted (“admitted”), whether there was a preceding emergency department visit (“ed_yes_no_0_365_before”), whether the patient needed intensive care admission (“icu_yes_no”) and date and time of icu admission (“icu_date_time”), demographic data (age, gender, ethnicity, race, zip code data - predominantly in the Bronx), some past medical history data (bmi data, asthma data, diabetes data) and one vital sign datum (systolic blood pressure).

The dataset is not tidy for a number of reasons including but not limited to:

  • Race seems to encompass Black, Asian, and White patients but whether patients are Hispanic or not is only coded under ethnicity. Although this is a complicated issue, for simplicity’s sake we combined ethnicity and race into a single variable.
  • Presence of a medical history of asthma and diabetes were indicated by presence of an International Classification of Disease Code (ICD). However, the same ICD code was not used for all patients. While there are certainly many subtleties in the severity of asthma and type/severity of diabetes, for ease of analysis we chose to simply code asthma and diabetes as binary variables.
  • We chose to add a variable called “obesity” which was a binary variable that was present when body mass index (BMI) was greater than or equal to 30 and absent when it was less than 30. Of note, in pediatric patients, true classification of obesity is dependent on the patient’s age and place on a growth chart.
ped_covid =
  read_csv("./data/p8105_final_ped_covid.csv") %>%
  mutate(
    ethnicity_race = case_when(
      race == "R3 Black or African-American"        ~ "black",
      race == "R2 Asian"                            ~ "asian",
      race == "R5 White"                            ~ "caucasian",
      race == "R1 American Indian or Alaska Native" ~ "american indian",
      race == "Multiple Selected"                   ~ "multiple",
      ethnicity == "E1 Spanish/Hispanic/Latino"     ~ "latino"
    )
  ) %>%
  mutate(
    asthma = replace_na(asthma_dx, 0),
    asthma = str_replace(asthma, ".*J.*", "1"),
    diabetes = replace_na(diabetes_dx, 0),
    diabetes = str_replace(diabetes, ".*E.*", "1"),
    zip = as.character(zip_code_set),
    service = outcomeadmission_admission_1inpatient_admit_service, 
    ed = ed_yes_no_0_365_before,
    admission_dx = admission_apr_drg,
    icu = icu_yes_no
  ) %>%
  mutate(obesity = case_when(
    bmi_value >= 30 ~ "1",
    bmi_value < 30 ~"0"
  ))

We noted that the dataset contained information about zipcode but not latitude and longitude. To add this information we merged zipcodes with a file in the usa package containing information about latitude and longitude for each zipcode.

# Merge zipcode data with latitude and longitude.
zipcode_df = 
  usa::zipcodes

ped_covid = 
  left_join(ped_covid, zipcode_df, by = "zip") %>%
  select(admitted, age, gender, ses, zip, eventdatetime, bmi_value, icu, icu_date_time, 
         systolic_bp_value, ethnicity_race, asthma, diabetes, zip, service, ed, admission_dx,
         city.y, obesity, lat, long) %>%
  mutate_at(c("admitted", "icu", "ethnicity_race", "asthma", "diabetes",
              "ed", "city.y", "obesity"), as.factor) %>%
  mutate(
    gender = factor(gender, levels = c("F", "M")),
    ethnicity_race = factor(ethnicity_race, levels = c("caucasian", "black", "latino", "asian", "american indian", "multiple", "NA"))
  ) %>%
  rename(city = city.y) 

An additional issue in the data is that it contains significant amounts of missing data. This is less of an issue for visualizations and logistic regression but will become an issue for model building with machine learning within the caret package. We thus used the mice package in R to impute missing data for later use in model building. This package uses sequential regression multiple imputation, which is advantageous over single imputations in that it accounts for the statistical uncertainty in the imputation process.

# Impute missing data for use in model_building
impute = mice(ped_covid, m=3, seed=111)
datacomplete = complete(impute,2)

# Export csv
write_csv(datacomplete, "./data/datacomplete.csv")

Exploratory Analyses

To perform exploratory analyses, we loaded some additional libraries.

library(patchwork)
library(corrplot)
library(mgcv)
library(modelr)

Our goal here is to develop a causation-type model instead of a prediction=targeted one. This is different than the interaction model tool, which was developed for predictive purposes. We will analyze correlations between admittance rate and multiple covariables and then develop a few select models to compare based on apriori knowledge and p-values. Another interesting variable that we would have liked to analyze was admittance to an ICU, but, as can be seen in the data summary, there were too few events of ICU admittance to draw any meaningful conclusions.

library(tidyverse)
library(patchwork)
library(corrplot)
library(mgcv)
library(modelr)

knitr::opts_chunk$set(
  fig.width = 6,
  fig.asp = 0.65,
  out.width = "100%",
  out.height = "75%")

theme_set(theme_minimal() + theme(legend.position = "bottom"))

options(
  ggplot2.continuous.colour = "viridis",
  ggplot2.continuous.fill = "viridis")

scale_colour_discrete = scale_color_viridis_d
scale_fill_discrete = scale_fill_viridis_d

knitr::opts_chunk$set(comment = NA, message = FALSE, warning = FALSE, echo = TRUE)

Data

ped_covid =
  read_csv("./data/p8105_final_ped_covid.csv") %>%
  mutate(
    ethnicity_race = case_when(
      race == "R3 Black or African-American"        ~ "black",
      race == "R2 Asian"                            ~ "asian",
      race == "R5 White"                            ~ "caucasian",
      race == "R1 American Indian or Alaska Native" ~ "american indian",
      race == "Multiple Selected"                   ~ "multiple",
      ethnicity == "E1 Spanish/Hispanic/Latino"     ~ "latino"
    )
  ) %>%
  mutate(
    asthma = replace_na(asthma_dx, 0),
    asthma = str_replace(asthma, ".*J.*", "1"),
    diabetes = replace_na(diabetes_dx, 0),
    diabetes = str_replace(diabetes, ".*E.*", "1"),
    zip = as.character(zip_code_set),
    service = outcomeadmission_admission_1inpatient_admit_service, 
    ed = ed_yes_no_0_365_before,
    admission_dx = admission_apr_drg,
    icu = icu_yes_no
  ) %>%
  mutate(obesity = case_when(
    bmi_value >= 30 ~ "1",
    bmi_value < 30 ~"0"
  ))

zipcode_df = 
  usa::zipcodes

ped_covid = 
  left_join(ped_covid, zipcode_df, by = "zip") %>%
  select(admitted, age, gender, ses, zip, eventdatetime, bmi_value, icu, icu_date_time, 
         systolic_bp_value, ethnicity_race, asthma, diabetes, zip, service, ed, admission_dx,
         city.y, obesity, lat, long) %>%
  mutate_at(c("admitted", "icu", "ethnicity_race", "asthma", "diabetes",
              "ed", "city.y", "obesity"), as.factor) %>%
  mutate(
    gender = factor(gender, levels = c("F", "M")),
    ethnicity_race = factor(ethnicity_race, levels = c("caucasian", "black", "latino", "asian"))
  ) %>%
  rename(city = city.y)

summary(ped_covid)
 admitted       age         gender         ses              zip           
 no :250   Min.   : 0.00   F   :189   Min.   :-13.506   Length:375        
 yes:125   1st Qu.:10.00   M   :184   1st Qu.: -6.921   Class :character  
           Median :18.00   NA's:  2   Median : -4.122   Mode  :character  
           Mean   :14.63              Mean   : -4.308                     
           3rd Qu.:21.00              3rd Qu.: -2.112                     
           Max.   :22.00              Max.   :  2.931                     
                                      NA's   :141                         
 eventdatetime        bmi_value     icu     icu_date_time     
 Length:375         Min.   :12.03   0:355   Length:375        
 Class :character   1st Qu.:18.73   1: 20   Class :character  
 Mode  :character   Median :23.20           Mode  :character  
                    Mean   :25.53                             
                    3rd Qu.:30.73                             
                    Max.   :80.84                             
                    NA's   :116                               
 systolic_bp_value   ethnicity_race asthma  diabetes   service          ed     
 Min.   : 61       caucasian: 18    0:314   0:358    Length:375         0:171  
 1st Qu.:103       black    : 75    1: 61   1: 17    Class :character   1:204  
 Median :117       latino   :189                     Mode  :character          
 Mean   :116       asian    :  8                                               
 3rd Qu.:128       NA's     : 85                                               
 Max.   :182                                                                   
 NA's   :240                                                                   
 admission_dx                 city     obesity         lat       
 Length:375         Bronx       :317   0   :187   Min.   :40.58  
 Class :character   Yonkers     : 11   1   : 72   1st Qu.:40.83  
 Mode  :character   Mount Vernon:  7   NA's:116   Median :40.85  
                    Brooklyn    :  6              Mean   :40.86  
                    New York    :  5              3rd Qu.:40.87  
                    (Other)     : 24              Max.   :41.52  
                    NA's        :  5              NA's   :5      
      long       
 Min.   :-74.20  
 1st Qu.:-73.90  
 Median :-73.88  
 Mean   :-73.88  
 3rd Qu.:-73.86  
 Max.   :-73.67  
 NA's   :5       

Dataset exploration

This is a dataset of 375 pediatric patients 0 to 23 years of age with COVID-19 infection. First, we explore the data by generating various plots.

Age

There appears to be a bimodal distribution of hospital admission as a function of age. Among infants and toddlers less than 5 years of age who test positive for COVID-19, more are admitted than not admitted. Note that this could be due to babies being admitted along with their mothers’ who had COVID-19. After 16 years of age, hospitalizations for COVID-19 infection appear to be less than non-hospitalizations.

admitt_p = 
  ped_covid %>% 
  ggplot(aes(x = age, fill = admitted)) +
  geom_density(alpha = .6) +
  labs(
    title = "Admittance/Age Distribution",
    x = "Age (Years)",
    y = "Density") +
  theme(legend.position = "bottom")

admitt_p

Diabetes and Asthma

The distribution of diabetes, and asthma diagnoses in pediatric patients with COVID-19 infection by age are shown below. This is generally true for diabetes and asthma as well.



```r
diabetes_p = 
  ped_covid %>% 
  ggplot(aes(x = age, fill = diabetes)) +
  geom_density(alpha = .6) +
  labs(
    title = "Diabetes/Age Distribution",
    x = "Age (Years)",
    y = "Density") +
  theme(legend.position = "bottom")

diabetes_p

asthma_p = 
  ped_covid %>% 
  ggplot(aes(x = age, fill = asthma)) +
  geom_density(alpha = .6) +
  labs(
    title = "Asthma/Age Distribution",
    x = "Age (Years)",
    y = "Density") +
  theme(legend.position = "bottom")

asthma_p

Box plots

Below, we explore first systolic blood pressure, BMI, and socioeconomic status (SES) by admission status. The median first systolic pressure is higher among admitted patients compared to non-admitted patients. BMI across admission status appears to be similar, with some high BMI outliers in the non-hospitalized group. Median SES is higher among admitted patients. The SES variable was defined by the hospital using multiple economic and educational parameters, with negative values indicating below average SES and positive values indicating above average SES.

bp_p =
  ped_covid %>% 
  ggplot(aes(x = admitted, y = systolic_bp_value)) +
  geom_boxplot() +
  labs(
    title = "Systolic Blood Pressure by Admission Status",
    x = "Hospital Admission Status",
    y = "Systolic Blood Pressure")

bp_p

bmi_p =
  ped_covid %>% 
  ggplot(aes(x = admitted, y = bmi_value)) +
  geom_boxplot() +
  labs(
    title = "BMI by Admission Status",
    x = "Hospital Admission Status",
    y = "BMI Value")

bmi_p

ses_p =
  ped_covid %>% 
  ggplot(aes(x = admitted, y = ses)) +
  geom_boxplot() +
  labs(
    title = "Socioeconomic Status by Admission Status",
    x = "Hospital Admission Status",
    y = "SES Measure")

ses_p

Note that we had categories for American Indian and Multiple/Mixed Race but there were too few counts for correlation significance purposes and were omitted for simplicity.

We can see from the correlation matrix that there is some positive correlation between age and BMI value (0.42), between BMI value and Asthma (0.24) ## Correlation Matrix with relevant covariates, and also between BMI value and diabetes (0.29)

cor_data = 
  cor(model.matrix(admitted ~ age + gender + ethnicity_race + asthma + diabetes + bmi_value + ses, ped_covid)[,-1])

cor_data %>% 
  corrplot(method = "color", addCoef.col = "black", tl.col = "black", tl.srt = 65, insig = "blank" , number.cex = 0.7, diag = FALSE)

Possible Models

When developing some of the models, we started by considering only categorical variables for simplicity along with the continuous variable of age. The ethnicity/race variable was the only non-binary categorical variables which we tried removing as you can see in the two model tables below. However, at the 5% level of significance, age and diabetes were still the only significant variables related to hospitalization.

Categorical-only (besides age) Model

race_mod =
glm(
  admitted ~ age + gender + ethnicity_race + asthma + diabetes + obesity,
  data = ped_covid,
  family = binomial()
  ) %>% 
  broom::tidy() %>% 
  mutate(
    OR = exp(estimate),
    CI_lower = exp(estimate - 1.96 * std.error),
    CI_upper = exp(estimate + 1.96 * std.error)
  ) %>% 
  select(term, OR, starts_with("CI"), p.value) %>% 
  knitr::kable(digits = 3)

race_mod
term OR CI_lower CI_upper p.value
(Intercept) 5.101 1.134 22.939 0.034
age 0.938 0.897 0.981 0.005
genderM 0.728 0.409 1.295 0.280
ethnicity_raceblack 0.336 0.086 1.318 0.118
ethnicity_racelatino 0.344 0.092 1.286 0.113
ethnicity_raceasian 0.323 0.044 2.371 0.266
asthma1 0.931 0.468 1.856 0.840
diabetes1 3.764 1.100 12.880 0.035
obesity1 2.166 1.070 4.384 0.032

Categorical-only Model (besides age) (Without Ethnicity/Race)

no_race_mod =
glm(
  admitted ~ age + gender + asthma + diabetes + obesity,
  data = ped_covid,
  family = binomial()
  ) %>% 
  broom::tidy() %>% 
  mutate(
    OR = exp(estimate),
    CI_lower = exp(estimate - 1.96 * std.error),
    CI_upper = exp(estimate + 1.96 * std.error)
  ) %>% 
  select(term, OR, starts_with("CI"), p.value) %>% 
  knitr::kable(digits = 3)

no_race_mod
term OR CI_lower CI_upper p.value
(Intercept) 1.570 0.857 2.876 0.144
age 0.945 0.910 0.980 0.003
genderM 0.865 0.519 1.444 0.580
asthma1 0.864 0.456 1.639 0.655
diabetes1 5.119 1.574 16.651 0.007
obesity1 1.812 0.964 3.403 0.065

We tried to include the continuous variables of BMI value and SES. Diabetes and BMI value were significant at the 5% level of significance, but age was insignificant.

Complex model containing all relevant variables

complex_mod = 
  glm(admitted ~ age + gender + ethnicity_race + asthma + diabetes + bmi_value + ses, 
      data = ped_covid,
      family = binomial(link = "logit")
      ) %>%  
  broom::tidy() %>% 
   mutate(
    OR = exp(estimate),
    CI_lower = exp(estimate - 1.96 * std.error),
    CI_upper = exp(estimate + 1.96 * std.error)
  ) %>%  
 select(term, OR, starts_with("CI"), p.value) %>% 
   knitr::kable(digits = 3)

complex_mod
term OR CI_lower CI_upper p.value
(Intercept) 1.140 0.149 8.723 0.900
age 0.930 0.862 1.004 0.063
genderM 0.534 0.261 1.095 0.087
ethnicity_raceblack 0.549 0.110 2.733 0.464
ethnicity_racelatino 0.441 0.092 2.113 0.306
ethnicity_raceasian 0.246 0.017 3.623 0.307
asthma1 0.959 0.439 2.095 0.916
diabetes1 5.286 1.257 22.233 0.023
bmi_value 1.054 1.009 1.100 0.017
ses 0.990 0.873 1.123 0.876

We postulated that the reason age became insignificant because we did not consider its interactions with BMI and Asthma, which we saw were related to ### Complex model but with age interactions

age_int_mod = 
  glm(admitted ~ age + gender + ethnicity_race + asthma + diabetes + bmi_value + ses + age*bmi_value + age*asthma, 
     data = ped_covid,
      family = binomial(link = "logit")
      ) %>%  
  broom::tidy() %>% 
  mutate(
    OR = exp(estimate),
    CI_lower = exp(estimate - 1.96 * std.error),
    CI_upper = exp(estimate + 1.96 * std.error)
  ) %>%  
  select(term, OR, starts_with("CI"), p.value) %>% 
  knitr::kable(digits = 3)

age_int_mod
term OR CI_lower CI_upper p.value
(Intercept) 15.954 0.329 774.203 0.162
age 0.800 0.654 0.979 0.030
genderM 0.564 0.271 1.170 0.124
ethnicity_raceblack 0.510 0.102 2.562 0.414
ethnicity_racelatino 0.425 0.087 2.065 0.289
ethnicity_raceasian 0.227 0.014 3.585 0.292
asthma1 0.604 0.046 7.900 0.701
diabetes1 5.708 1.347 24.188 0.018
bmi_value 0.931 0.787 1.101 0.405
ses 0.996 0.877 1.132 0.952
age:bmi_value 1.007 0.998 1.016 0.139
age:asthma1 1.028 0.887 1.192 0.712

Complex Model with other interactions but age interactions

other_int_mod = 
  glm(admitted ~ age + gender + ethnicity_race + asthma + diabetes + bmi_value + ses + ses*bmi_value, 
     data = ped_covid,
      family = binomial(link = "logit")
      ) %>%  
  broom::tidy() %>% 
  mutate(
    OR = exp(estimate),
    CI_lower = exp(estimate - 1.96 * std.error),
    CI_upper = exp(estimate + 1.96 * std.error)
  ) %>%  
  select(term, OR, starts_with("CI"), p.value) %>% 
  knitr::kable(digits = 3)

other_int_mod
term OR CI_lower CI_upper p.value
(Intercept) 1.154 0.065 20.525 0.922
age 0.930 0.862 1.004 0.064
genderM 0.535 0.261 1.095 0.087
ethnicity_raceblack 0.548 0.109 2.752 0.465
ethnicity_racelatino 0.440 0.092 2.116 0.306
ethnicity_raceasian 0.245 0.016 3.785 0.314
asthma1 0.959 0.438 2.097 0.916
diabetes1 5.292 1.244 22.503 0.024
bmi_value 1.053 0.966 1.149 0.242
ses 0.992 0.662 1.488 0.970
bmi_value:ses 1.000 0.986 1.014 0.990

Complex Model with all interactions

int_mod = 
  glm(admitted ~ age + gender + asthma + ethnicity_race + diabetes + bmi_value + ses + ses*bmi_value + age*asthma + age*bmi_value, 
      data = ped_covid,
      family = binomial(link = "logit")
      ) %>%  
  broom::tidy() %>% 
   mutate(
    OR = exp(estimate),
    CI_lower = exp(estimate - 1.96 * std.error),
    CI_upper = exp(estimate + 1.96 * std.error)
  ) %>%  
 select(term, OR, starts_with("CI"), p.value) %>% 
   knitr::kable(digits = 3)

int_mod
term OR CI_lower CI_upper p.value
(Intercept) 15.456 0.195 1222.505 0.220
age 0.800 0.654 0.979 0.030
genderM 0.563 0.271 1.170 0.124
asthma1 0.604 0.046 7.901 0.701
ethnicity_raceblack 0.511 0.101 2.593 0.418
ethnicity_racelatino 0.425 0.087 2.072 0.290
ethnicity_raceasian 0.229 0.014 3.790 0.303
diabetes1 5.692 1.329 24.373 0.019
bmi_value 0.932 0.776 1.120 0.453
ses 0.990 0.656 1.495 0.962
bmi_value:ses 1.000 0.986 1.014 0.975
age:asthma1 1.028 0.887 1.192 0.712
age:bmi_value 1.007 0.998 1.016 0.139

Dashboard

Next, we created an interactive dashboard to further explore and visualize the data.

First we loaded additional necessary packages.

library(plotly)
library(leaflet)
library(tidycensus)
library(tmap)

To create a map, we categorized the cities into their respective counties and used the census data from the tidycensus package.

covid_df = 
  read_csv("./data/p8105_final_ped_covid.csv") %>% 
   janitor::clean_names() %>%
   mutate(county = NA,
          county = as.character(county)) %>% 
   select(city, county, everything()) %>% 
   mutate(city = tolower(city)) %>% 
   mutate(city = str_replace(city, "n white plains", "white plains")) %>% 
   mutate(county = case_when(city == "bronx" ~ "bronx",
                             city == "brooklyn" ~ "kings",
                             city == "yonkers" ~ "westchester",
                             city == "new york" ~ "new york",
                             city == "mount vernon" ~ "westchester",
                             city == "new rochelle" ~ "westchester",
                             city == "white plains" ~ "westchester",
                             city == "ridgewood" ~ "queens",
                             city == "nanuet" ~ "rockland",
                             city == "bergenfield" ~ "bergen",
                             city == "ossining" ~ "westchester",
                             city == "monroe" ~ "orange",
                             city == "newburgh" ~ "orange",
                             city == "staten island" ~ "richmond",
                             city == "port chester" ~ "westchester",
                             city == "spring valley" ~ "rockland",
                             city == "irvington" ~ "westchester",
                             city == "flushing" ~ "queens",
                             city == "chappaqua" ~ "westchester",
                             city == "new city" ~ "rockland",
                             city == "ferncliff manor" ~ "westchester",
                             city == "greenwich" ~ "washington",
                             city == "haverstraw" ~ "rockland",
                             city == "suffern" ~ "rockland",
                             city == "berkeley heights" ~ "union")
   ) %>% 
   mutate(eventdatetime = as.Date(eventdatetime, "%m/%d/%Y"),
          eventdatetime = format(eventdatetime, "%m-%Y"),
          eventdatetime = zoo::as.yearmon(eventdatetime, "%m-%Y")) %>% 
  mutate(
    ethnicity_race = case_when(
      race == "R3 Black or African-American"        ~ "black",
      race == "R2 Asian"                            ~ "asian",
      race == "R5 White"                            ~ "caucasian",
      race == "R1 American Indian or Alaska Native" ~ "american indian",
      race == "Multiple Selected"                   ~ "multiple",
      ethnicity == "E1 Spanish/Hispanic/Latino"     ~ "latino"
    ))

Average Socio-Economic Status (SES) and Average Body Mass Index (BMI)

Here, we have generated an interactive map that demonstrates average SES and average BMI across various parts of New York and New Jersey, reflecting the residences of pediatric patients with COVID-19 infection. By selecting from the layered box on the top left corner, users can customize different base layers for the map, as well as average SES or average BMI.

# Map Viz using tidycensus and tmap
county_ny = c("bronx", "kings", "westchester", "new york", "queens", "rockland", "orange", "richmond")
county_nj = c("bergen", "union")

shape_ny = 
  get_acs(geography = "tract",
          variables = "B19013_001",
          state = "NY",
          county = county_ny,
          geometry = TRUE) %>%
  janitor::clean_names() %>% 
  select(name, geometry) %>% 
  separate(name, into = c("county", "state"), sep = -17) %>% 
  mutate(state = str_sub(state, 10),
         county = tolower(county),
         county = sub(".*\\s", "", trimws(county)),
         county = str_replace(county, "york", "new york"))

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=                                                                     |   1%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |===                                                                   |   5%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |=======                                                               |   9%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |========                                                              |  11%
  |                                                                            
  |=========                                                             |  12%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |=========                                                             |  14%
  |                                                                            
  |==========                                                            |  15%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |============                                                          |  17%
  |                                                                            
  |=============                                                         |  18%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |==============                                                        |  21%
  |                                                                            
  |===============                                                       |  21%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |=================                                                     |  25%
  |                                                                            
  |==================                                                    |  26%
  |                                                                            
  |===================                                                   |  27%
  |                                                                            
  |====================                                                  |  28%
  |                                                                            
  |====================                                                  |  29%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |======================                                                |  31%
  |                                                                            
  |======================                                                |  32%
  |                                                                            
  |=======================                                               |  32%
  |                                                                            
  |========================                                              |  34%
  |                                                                            
  |========================                                              |  35%
  |                                                                            
  |=========================                                             |  36%
  |                                                                            
  |==========================                                            |  37%
  |                                                                            
  |==========================                                            |  38%
  |                                                                            
  |===========================                                           |  39%
  |                                                                            
  |============================                                          |  39%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |=============================                                         |  41%
  |                                                                            
  |==============================                                        |  42%
  |                                                                            
  |===============================                                       |  44%
  |                                                                            
  |===============================                                       |  45%
  |                                                                            
  |================================                                      |  45%
  |                                                                            
  |================================                                      |  46%
  |                                                                            
  |=================================                                     |  47%
  |                                                                            
  |==================================                                    |  48%
  |                                                                            
  |==================================                                    |  49%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================                                  |  51%
  |                                                                            
  |=====================================                                 |  53%
  |                                                                            
  |======================================                                |  54%
  |                                                                            
  |=======================================                               |  55%
  |                                                                            
  |=======================================                               |  56%
  |                                                                            
  |========================================                              |  57%
  |                                                                            
  |=========================================                             |  59%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |===========================================                           |  62%
  |                                                                            
  |============================================                          |  63%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |==============================================                        |  66%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |===============================================                       |  68%
  |                                                                            
  |================================================                      |  69%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |==================================================                    |  72%
  |                                                                            
  |===================================================                   |  72%
  |                                                                            
  |===================================================                   |  73%
  |                                                                            
  |====================================================                  |  74%
  |                                                                            
  |=====================================================                 |  75%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |======================================================                |  78%
  |                                                                            
  |=======================================================               |  79%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |=========================================================             |  81%
  |                                                                            
  |==========================================================            |  82%
  |                                                                            
  |==========================================================            |  83%
  |                                                                            
  |===========================================================           |  84%
  |                                                                            
  |============================================================          |  85%
  |                                                                            
  |=============================================================         |  87%
  |                                                                            
  |=============================================================         |  88%
  |                                                                            
  |==============================================================        |  89%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |================================================================      |  91%
  |                                                                            
  |=================================================================     |  93%
  |                                                                            
  |==================================================================    |  94%
  |                                                                            
  |====================================================================  |  96%
  |                                                                            
  |====================================================================  |  97%
  |                                                                            
  |===================================================================== |  98%
  |                                                                            
  |===================================================================== |  99%
  |                                                                            
  |======================================================================| 100%
shape_nj = 
  get_acs(geography = "tract",
          variables = "B19013_001",
          state = "NJ",
          county = county_nj,
          geometry = TRUE) %>%
  janitor::clean_names() %>% 
  select(name, geometry) %>% 
  separate(name, into = c("county", "state"), sep = -18) %>% 
  mutate(state = str_sub(state, 9),
         county = tolower(county),
         county = sub(".*\\s", "", trimws(county)))

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==                                                                    |   2%
  |                                                                            
  |====                                                                  |   5%
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |======                                                                |   9%
  |                                                                            
  |========                                                              |  11%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |==========                                                            |  15%
  |                                                                            
  |============                                                          |  17%
  |                                                                            
  |=============                                                         |  18%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |===================                                                   |  27%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |==========================                                            |  36%
  |                                                                            
  |===========================                                           |  39%
  |                                                                            
  |==============================                                        |  42%
  |                                                                            
  |================================                                      |  46%
  |                                                                            
  |==================================                                    |  48%
  |                                                                            
  |====================================                                  |  51%
  |                                                                            
  |======================================                                |  55%
  |                                                                            
  |==========================================                            |  61%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |==============================================                        |  66%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |===================================================                   |  73%
  |                                                                            
  |=======================================================               |  79%
  |                                                                            
  |==========================================================            |  82%
  |                                                                            
  |===========================================================           |  85%
  |                                                                            
  |==============================================================        |  88%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |================================================================      |  91%
  |                                                                            
  |====================================================================  |  97%
  |                                                                            
  |======================================================================|  99%
  |                                                                            
  |======================================================================| 100%
shape_full =
rbind(shape_ny, shape_nj)

bmi_mean =
  covid_df %>%
  group_by(county) %>%
  summarize(bmi_mean = mean(bmi_value, na.rm = TRUE))

ses_mean = 
  covid_df %>% 
  group_by(county) %>% 
  summarize(ses_mean = mean(ses, na.rm = TRUE))

full_map_bmi = 
  left_join(shape_full, bmi_mean, by = "county")

full_map_ses = 
  left_join(shape_full, ses_mean, by = "county")

tmap_mode("view")
tm_full = 
  tm_shape(full_map_bmi) +
    tm_fill(
      n = 4,
      group = "Average BMI",
      midpoint = NA,
      col = "bmi_mean",
      palette = "viridis",
      style = "quantile",
      contrast = c(0.3, 1),
      title = "Average BMI",
      textNA = "Not Available",
      id = "state",
      popup.vars=c("County: " = "county",
                   "Average BMI: " = "bmi_mean")) +
      tm_borders(col = "white") +
  tm_shape(full_map_ses) +
    tm_fill(
      n = 4,
      group = "Average SES",
      midpoint = NA,
      col = "ses_mean",
      palette = "RdYlBu",
      style = "quantile",
      contrast = c(0.3, 1),
      title = "Average SES",
      textNA = "Not Available",
      id = "state",
      popup.vars=c("County: " = "county",
                   "Average SES: " = "ses_mean")) +
  tm_borders(col = "white") +
  tm_view(
    alpha = 0.85,
    view.legend.position = c("right", "bottom"),
    colorNA = NULL) +
  tm_scale_bar(text.size = 1) +
  tm_facets(as.layers = TRUE, sync = TRUE)

tm_full %>% 
  tmap_leaflet() %>%
  leaflet::hideGroup("Average SES")

Positivity levels over time

The number of pediatric patients testing positive for COVID-19 is shown as a function of time, by admission status (yes/no). The number of patients who returned positive SARS-CoV-2 RT-PCR tests peaked in late February and decreased until April. From April to June, levels of positive tests generally plateaued for both admitted and non-admitted patients until June, at which point levels dropped off to close to zero for both groups.

plot_1 = 
  covid_df %>% 
    group_by(eventdatetime, admitted) %>% 
    summarize(count = n()) %>% 
    ggplot(aes(x = eventdatetime, 
               y = count, 
               color = admitted)) +
    geom_point(aes(text = paste("Date: ", eventdatetime, 
                                "\nNumber of Counts: ", count, 
                                "\nAdmitted: ", admitted))) +
    geom_line() +
    labs(title = "",
           x = "Event Date",
           y = "Number of Events",
           color = "Admitted")

ggplotly(plot_1, tooltip = "text")

Ethnicity and race information of admitted and non-admitted pediatric patients

The ethnic and racial background of pediatric patients with COVID-19 infection reflect the diverse population served in the Bronx. Latinos represent the majority of patients in whom positive SARS-CoV-2 RT-PCR test results were returned, followed by blacks, caucasians, and asians. Overall, latino and black children with COVID-19 infection seem to have lower hospitalization rates compared to caucasian and asian children.

plot_2 = 
  covid_df %>% 
    group_by(ethnicity_race, admitted) %>% 
    summarize(count = n()) %>%
    filter(ethnicity_race != "american indian") %>% 
    filter(ethnicity_race != "multiple") %>% 
    ggplot(aes(x = fct_reorder(ethnicity_race, count), 
               y = count, 
               fill = admitted,
               text = paste("Ethnicity: ", ethnicity_race, 
                            "\nCounts: ", count, 
                            "\nAdmitted: ", admitted))) +
    geom_bar(stat="identity", position=position_dodge()) +
    coord_flip() +
    labs(title = "",
           x = "Counts",
           y = "Ethnicity",
           fill = "Admitted")

ggplotly(plot_2, tooltip = "text")

Admissions for COVID-19 and age

The below interactive bar graph is another representation of our earlier exploratory ggplot showing the density or count of COVID-19 positivity as function of age. This information is further stratified by admission status (yes/no).

plot_3 = 
  covid_df %>% 
    mutate(age = round(age),
           age = as.factor(age)) %>% 
    group_by(age, admitted) %>% 
    summarize(count = n()) %>%
    ggplot(aes(x = age, 
               y = count, 
               fill = admitted,
               text = paste("Age: ", age, 
                            "\nCounts: ", count, 
                            "\nAdmitted: ", admitted))) +
    geom_bar(stat="identity", position=position_dodge()) +
    labs(title = "",
           x = "Age",
           y = "Counts",
           fill = "Admitted")

ggplotly(plot_3, tooltip = "text")

Body mass index (BMI), age, and admission status

The scatterplot shows BMI values versus age by admission status with a smooth curve. In general, BMI values increase as age increases for both admitted and non-admitted patients. The smooth curves generally overlap but diverge beginning at around age 14, with a trend for more admissions for extremely obese patients with high outlier BMI values.

plot_4 =
  covid_df %>% 
    ggplot(aes(x = age, y = bmi_value, color = admitted)) +
    geom_point(aes(text = paste("Age: ", age, 
                            "\nBMI: ", bmi_value, 
                            "\nAdmitted: ", admitted))) +
    geom_smooth(se = FALSE) +
    labs(title = "",
           x = "Age",
           y = "BMI",
           color = "Admitted")

ggplotly(plot_4, tooltip = "text")

Body mass index (BMI), age, and ethnicity/race

The below interactive plot shows plots of BMI as a function of age, stratified by race. In general, BMI values seem to increase as age increases. There are outliers with some extreme BMI values, including some with a BMI of greater than 60.

plot_5 =
  covid_df %>% 
    filter(ethnicity_race != "multiple") %>% 
    ggplot(aes(x = age, y = bmi_value, color = ethnicity_race)) +
    geom_point(aes(text = paste("Age: ", age, 
                            "\nBMI: ", bmi_value, 
                            "\nEthnicity: ", ethnicity_race))) +
    geom_smooth(se = FALSE) +
    labs(title = "",
           x = "Age",
           y = "BMI",
           color = "Ethnicity")

ggplotly(plot_5, tooltip = "text")

Body mass index (BMI) by Gender

The below interactive box plot explores the relationship between BMI and gender. The median BMI is higher in female children with COVID-19 infection compared to male children. There are some high BMI outliers, particularly among male children, with one BMI of 80.84.

plot_6 =
  covid_df %>% 
  filter(gender != "U") %>% 
  ggplot(aes(x = gender, y = bmi_value, fill = gender)) +
   geom_boxplot(draw_quantiles = c(0.25, 0.5, 0.75)) +
   stat_summary(fun = "mean", color = "blue") +
   labs(
     title = "",
     x = "Gender",
     y = "BMI",
     caption = ""
     ) +
   theme(legend.position="none")

ggplotly(plot_6)

Body mass index (BMI) and Socio-Economic Status (SES) by Gender

This interactive scatterplot explores the relationship between SES and BMI in this dataset. In general, SES and BMI values are clustered at lower values of BMI with a spread of SES that seems similar for males and females between 0 and -10. Briefly, the SES variable is a score generated from census-block groups with increasing score signifying an increasing neighborhood socioeconomic advantage. BMI values appears to be unevenly distributed with clustering of values below 30 and clustering of values greater than 30. Again, we observe the high outlier values of BMI for males and these appear to be distributed in the lower half of SES values.

plot_7 =
  covid_df %>% 
    filter(gender != "U") %>%
    ggplot(aes(x = bmi_value, y = ses, color = gender)) +
    geom_point(aes(text = paste("SES: ", ses, 
                            "\nBMI: ", bmi_value, 
                            "\nGender: ", gender))) +
    labs(title = "",
           x = "BMI",
           y = "SES",
           color = "Gender")

ggplotly(plot_7, tooltip = "text")

Age, Body mass index (BMI) and Socio-Economic Status (SES) by Gender

The below scatterplot further explores BMI and age in a faceted plot by gender. Hovering over each observation will reveal the specific age, BMI, gender, as well as SES. Futhermore, the area of each circle corresponds to the SES of that particular patient.

plot_8 =
  covid_df %>% 
    filter(gender != "U") %>%
    ggplot(aes(x = age, 
               y = bmi_value, 
               size = ses, 
               color = gender,
               text = paste("Age: ", age,
                            "\nSES: ", ses, 
                            "\nBMI: ", bmi_value, 
                            "\nGender: ", gender))) +
      geom_point(alpha=0.4) +
      scale_size(range = c(1, 8)) +
      theme(legend.position="none") +
      facet_grid(.~gender) +
      labs(title = "",
           x = "Age",
           y = "BMI",
           color = "Gender")

ggplotly(plot_8, tooltip = "text")

Average age by geographic location

The below interactive bar graph explores the relationship between average age of COVID-19 infected children and geographic location. The average age of pediatric patients was 14.5 years in the Bronx, which returned the greatest positivity count. Staten Island had the oldest average age (22 years) and Berkeley Heights had the youngest average age (5 years), but there was only one patient from each of these respective locations.

plot_9 =
  covid_df %>% 
  group_by(city, county) %>% 
  summarize(mean_age = mean(age, na.rm = TRUE),
            count = n()) %>%
  drop_na(city) %>% 
  ggplot(aes(x = fct_reorder(city, mean_age),
             y = mean_age,
             fill = count,
             text = paste("City: ", city, 
                          "\nAvg Age: ", mean_age, 
                          "\nCounts: ", count))) +
  geom_bar(stat="identity") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
  labs(title = "",
         x = "City",
         y = "Average Age",
         fill = "Count")
  
ggplotly(plot_9, tooltip = "text")

Body mass index (BMI) and Socio-Economic Status (SES)

This heat map explores the density of BMI and SES, suggesting two regions of greatest density for these variables. One “hot” patch seems to be defined by SES values ranging from -6.5 to -7.5 and BMI values from 20.5 to 24. The second hot patch consists of higher SES values, with a range of -1.3 to -2.9 and BMI values ranging from 22 to 25.

plot_10 = 
  covid_df %>% 
  ggplot(aes(x = ses, y = bmi_value)) +
  stat_density_2d(aes(fill = ..density..), geom = "raster", contour = FALSE) +
  ylim(10, 35) +
  theme(legend.position = "right") +
  labs(title = "",
       x = "SES",
       y = "BMI",
       fill = "Density")

ggplotly(plot_10)

Model Building

Our next goal was to build an interactive predictive tool allowing users to see the effects of changing certain predictors on risk of hospitalization. To do this we examined several predictive models.

Based on our exploratory analysis as well as a priori knowledge, we incorporated the following variables into our predictive model for hospitalization:

  • Age
  • BMI value
  • Systolic blood pressure value
  • Race
  • History of asthma diagnosis
  • History of diabetes diagnosis

Our goal in the following analyses was to examine the area under the operating characteristic curve (AUC) for various models. We used the caret package in R, which allows easy implentation and tuning of various different machine learning algorithms.

First we loaded in the data and divided the data into training and testing sets using a 2:1 ratio.

# Load in the data and divide into training and testing sets
library(caret)
library(tidyverse)
library(ROCR)

datacomplete = read_csv("./data/datacomplete.csv") %>%
  mutate_at(c("admitted", "ethnicity_race", "asthma", "diabetes"), as.factor) %>%
  select(admitted, age, bmi_value, systolic_bp_value, ethnicity_race, asthma, diabetes)

train_data = 
  datacomplete %>%
  slice(1:250) 

test_data = 
  datacomplete %>%
  slice(251:375)

Random Forest Model

First we used a random forest model, which is an ensemble learning method for classification or regression which constructs a multitude of decision trees using a training set and outputs the class that is the mode of the classes of the individual trees.

myFolds = createFolds(train_data$admitted, k = 5)
rfControl = trainControl(
    summaryFunction = twoClassSummary,
    classProbs = TRUE, 
    verboseIter = TRUE,
    savePredictions = TRUE,
    index = myFolds
)
set.seed(123)
rfmodel = caret::train(
    admitted ~., 
    data = train_data,
    metric = "ROC",
    method = "ranger",
    trControl = rfControl
 )
+ Fold1: mtry= 2, min.node.size=1, splitrule=gini 
- Fold1: mtry= 2, min.node.size=1, splitrule=gini 
+ Fold1: mtry= 6, min.node.size=1, splitrule=gini 
- Fold1: mtry= 6, min.node.size=1, splitrule=gini 
+ Fold1: mtry=10, min.node.size=1, splitrule=gini 
- Fold1: mtry=10, min.node.size=1, splitrule=gini 
+ Fold1: mtry= 2, min.node.size=1, splitrule=extratrees 
- Fold1: mtry= 2, min.node.size=1, splitrule=extratrees 
+ Fold1: mtry= 6, min.node.size=1, splitrule=extratrees 
- Fold1: mtry= 6, min.node.size=1, splitrule=extratrees 
+ Fold1: mtry=10, min.node.size=1, splitrule=extratrees 
- Fold1: mtry=10, min.node.size=1, splitrule=extratrees 
+ Fold2: mtry= 2, min.node.size=1, splitrule=gini 
- Fold2: mtry= 2, min.node.size=1, splitrule=gini 
+ Fold2: mtry= 6, min.node.size=1, splitrule=gini 
- Fold2: mtry= 6, min.node.size=1, splitrule=gini 
+ Fold2: mtry=10, min.node.size=1, splitrule=gini 
- Fold2: mtry=10, min.node.size=1, splitrule=gini 
+ Fold2: mtry= 2, min.node.size=1, splitrule=extratrees 
- Fold2: mtry= 2, min.node.size=1, splitrule=extratrees 
+ Fold2: mtry= 6, min.node.size=1, splitrule=extratrees 
- Fold2: mtry= 6, min.node.size=1, splitrule=extratrees 
+ Fold2: mtry=10, min.node.size=1, splitrule=extratrees 
- Fold2: mtry=10, min.node.size=1, splitrule=extratrees 
+ Fold3: mtry= 2, min.node.size=1, splitrule=gini 
- Fold3: mtry= 2, min.node.size=1, splitrule=gini 
+ Fold3: mtry= 6, min.node.size=1, splitrule=gini 
- Fold3: mtry= 6, min.node.size=1, splitrule=gini 
+ Fold3: mtry=10, min.node.size=1, splitrule=gini 
- Fold3: mtry=10, min.node.size=1, splitrule=gini 
+ Fold3: mtry= 2, min.node.size=1, splitrule=extratrees 
- Fold3: mtry= 2, min.node.size=1, splitrule=extratrees 
+ Fold3: mtry= 6, min.node.size=1, splitrule=extratrees 
- Fold3: mtry= 6, min.node.size=1, splitrule=extratrees 
+ Fold3: mtry=10, min.node.size=1, splitrule=extratrees 
- Fold3: mtry=10, min.node.size=1, splitrule=extratrees 
+ Fold4: mtry= 2, min.node.size=1, splitrule=gini 
- Fold4: mtry= 2, min.node.size=1, splitrule=gini 
+ Fold4: mtry= 6, min.node.size=1, splitrule=gini 
- Fold4: mtry= 6, min.node.size=1, splitrule=gini 
+ Fold4: mtry=10, min.node.size=1, splitrule=gini 
- Fold4: mtry=10, min.node.size=1, splitrule=gini 
+ Fold4: mtry= 2, min.node.size=1, splitrule=extratrees 
- Fold4: mtry= 2, min.node.size=1, splitrule=extratrees 
+ Fold4: mtry= 6, min.node.size=1, splitrule=extratrees 
- Fold4: mtry= 6, min.node.size=1, splitrule=extratrees 
+ Fold4: mtry=10, min.node.size=1, splitrule=extratrees 
- Fold4: mtry=10, min.node.size=1, splitrule=extratrees 
+ Fold5: mtry= 2, min.node.size=1, splitrule=gini 
- Fold5: mtry= 2, min.node.size=1, splitrule=gini 
+ Fold5: mtry= 6, min.node.size=1, splitrule=gini 
- Fold5: mtry= 6, min.node.size=1, splitrule=gini 
+ Fold5: mtry=10, min.node.size=1, splitrule=gini 
- Fold5: mtry=10, min.node.size=1, splitrule=gini 
+ Fold5: mtry= 2, min.node.size=1, splitrule=extratrees 
- Fold5: mtry= 2, min.node.size=1, splitrule=extratrees 
+ Fold5: mtry= 6, min.node.size=1, splitrule=extratrees 
- Fold5: mtry= 6, min.node.size=1, splitrule=extratrees 
+ Fold5: mtry=10, min.node.size=1, splitrule=extratrees 
- Fold5: mtry=10, min.node.size=1, splitrule=extratrees 
Aggregating results
Selecting tuning parameters
Fitting mtry = 2, splitrule = gini, min.node.size = 1 on full training set
pred_rf = predict(rfmodel,test_data,type="prob")
pred_rf = pred_rf$yes
auc_rf = as.numeric(ROCR::performance(prediction(pred_rf,test_data$admitted),"auc")@y.values)
auc_rf
[1] 0.5561543

Our AUC with the random forest model is 0.5561543.

XGBoost Model

The XGBoost model is a powerful ensemble machine learning algorithm that implements gradient boosted decision trees.

xgbTrainingControl = trainControl(method = "CV", 
                                     number = 5, 
                                     savePredictions = TRUE, 
                                     classProbs = TRUE, 
                                     summaryFunction = twoClassSummary,
                                     verboseIter = F)
nrounds = 1000
  
xgb_grid = expand.grid(
    nrounds = seq(from = 200, to = nrounds, by = 50),
    eta = c(0.025, 0.05, 0.1, 0.3),
    max_depth = c(2, 3, 4, 5, 6),
    gamma = 0,
    colsample_bytree = 1,
    min_child_weight = 1,
    subsample = 1
)

set.seed(123)
fit_xgb = caret::train(admitted~., 
                         data = train_data, 
                         metric = "ROC",
                         method = "xgbTree", 
                         tuneGrid = xgb_grid,
                         trControl = xgbTrainingControl)
  
pred_xgb = predict(fit_xgb,test_data,type="prob")
pred_xgb = pred_xgb$yes
auc_xgb = as.numeric(ROCR::performance(ROCR::prediction(pred_xgb,test_data$admitted),"auc")@y.values)
auc_xgb
[1] 0.6304594

The AUC generated by the XGB model is 0.6304594.

Glmnet Model

Next we explored the use of a generalized linear model through Glmnet, which fits a generalized linear model via penalized maximum likelihood.

glmnetControl = trainControl(
    method = "cv", 
    number = 10,
    summaryFunction = twoClassSummary,
    classProbs = TRUE, 
    verboseIter = TRUE
)
set.seed(123)
glmnetmodel = caret::train(
    admitted ~. , 
    data=train_data,
    tuneGrid = expand.grid(
      alpha=0:1,
      lambda=seq(0.0001,1, length=20)
    ),
    method = "glmnet",
    trControl = glmnetControl,
    preProcess= c("center","scale")
  )
+ Fold01: alpha=0, lambda=1 
- Fold01: alpha=0, lambda=1 
+ Fold01: alpha=1, lambda=1 
- Fold01: alpha=1, lambda=1 
+ Fold02: alpha=0, lambda=1 
- Fold02: alpha=0, lambda=1 
+ Fold02: alpha=1, lambda=1 
- Fold02: alpha=1, lambda=1 
+ Fold03: alpha=0, lambda=1 
- Fold03: alpha=0, lambda=1 
+ Fold03: alpha=1, lambda=1 
- Fold03: alpha=1, lambda=1 
+ Fold04: alpha=0, lambda=1 
- Fold04: alpha=0, lambda=1 
+ Fold04: alpha=1, lambda=1 
- Fold04: alpha=1, lambda=1 
+ Fold05: alpha=0, lambda=1 
- Fold05: alpha=0, lambda=1 
+ Fold05: alpha=1, lambda=1 
- Fold05: alpha=1, lambda=1 
+ Fold06: alpha=0, lambda=1 
- Fold06: alpha=0, lambda=1 
+ Fold06: alpha=1, lambda=1 
- Fold06: alpha=1, lambda=1 
+ Fold07: alpha=0, lambda=1 
- Fold07: alpha=0, lambda=1 
+ Fold07: alpha=1, lambda=1 
- Fold07: alpha=1, lambda=1 
+ Fold08: alpha=0, lambda=1 
- Fold08: alpha=0, lambda=1 
+ Fold08: alpha=1, lambda=1 
- Fold08: alpha=1, lambda=1 
+ Fold09: alpha=0, lambda=1 
- Fold09: alpha=0, lambda=1 
+ Fold09: alpha=1, lambda=1 
- Fold09: alpha=1, lambda=1 
+ Fold10: alpha=0, lambda=1 
- Fold10: alpha=0, lambda=1 
+ Fold10: alpha=1, lambda=1 
- Fold10: alpha=1, lambda=1 
Aggregating results
Selecting tuning parameters
Fitting alpha = 1, lambda = 0.0527 on full training set
pred_glmnet = predict(glmnetmodel,test_data,type="prob") 
pred_glmnet = pred_glmnet$yes
auc_glmnet = as.numeric(ROCR::performance(prediction(pred_glmnet,test_data$admitted),"auc")@y.values)
auc_glmnet
[1] 0.6202496

The AUC of the glmnet model is 0.6202496.

SPLS Model

Finally, we tried a partial least squares model, which is a linear regression method that can deal with large numbers of predictors, small sample size, and high collinearity among predictors. That is less the case in our trimmed down set of predictors, but nonetheless we decided to examine how it would perform in our dataset.

#SPLS
train_grid = expand.grid(K =c(3,4,5,6,7,8,9),eta =c(0.35,0.4,0.45,0.5,0.55,0.6,0.7,0.8,0.9),
                          kappa=0.5)

set.seed(123)
plsFit = caret::train(admitted~.,
                         data = train_data,
                         method = "spls",
                         tuneGrid = train_grid,
                         trControl = xgbTrainingControl,
                         metric = "ROC")

pred_plsda = predict(plsFit,test_data,type="prob")
pred_plsda = pred_plsda$yes
auc_plsda = as.numeric(ROCR::performance(prediction(pred_plsda,test_data$admitted),"auc")@y.values)
auc_plsda

The AUC for the SPLS model is 0.6101815.

Overall, all the AUCs are much lower than what would be considered useful in clinical practice. However, given that the goal of our analyses is mainly exploratory at this point, we then built an interactive predictive tool allowing users to see the effects of changing certain predictors on risk of hospitalization. Results should clearly be interpreted with caution and the models will benefit from the addition of other predictors which can be added on as this dataset grows. We ultimately chose to use the random forest model, given similar AUCs between each of the models and the relatively quick speed at which the random forest model generates its predictions.

Code for the shiny app can be found at Lusha’s github.

Discussion

We performed an exploratory analysis of a dataset of 375 pediatric patients who tested positive for COVID-19 and assessed predictors for hospitalization. We found age, obesity, and diabetes were associated with the need for hospitalization in multivariable logistic regression modeling. Gender, ethnicity, and asthma were not found to be significantly associated with the need for hospitalization. There are a number of important limitations in this dataset that may affect the validity of the study results. First, the availability of testing was limited in the early phases of the COVID-19 pandemic. This decreased its diagnostic utility, with SARS-CoV-2 RT PCR testing done in a confirmatory manner sometimes after the decision to admit was already made. This differential testing may affect the validity of study results. The sample size of 375 from a single tertiary care medical center is also limited, reflecting the relative overall decreased severity of illness of COVID-19 in pediatric patients compared to adults. Asthma was coded in a simplistic manner that does not reflect the complexity of this comorbidity. For example, asthma can vary in severity from, for example, mild intermittent to status asthmaticus. Our coding of asthma does not capture the gradient of disease severity. Obesity was also coded as a simple binary variable with a BMI cut-point of 30. In reality, obesity in children is defined by the 95th percentile of BMI, which varies by age.

Finally, the clinical decision to admit a pediatric patient with confirmed or suspected COVID-19 infection is a complex one. This dataset contained limited information about vital signs such as respiratory rate or oxygen saturation, which reflect acute respiratory distress. While past medical history is important, acute changes in a child’s pulmonary status are more clinically important in the decision to admit or not admit. These data were not available in this dataset. Notwithstanding these limitations, the exploratory data analyses in this project set the foundation for future research, which should take into account additional clinical data.

References

  1. McCreary EK, Pogue JM. Coronavirus disease 2019 treatment: a review of early and emerging options. Open Forum Infect Dis. 2020;7(4):ofaa105
  2. Dong Y, Mo X, Hu Y, et al. Epidemiology of COVID-19 among children in China. Pediatrics. 2020;145(6):20200702e
  3. American Academy of Pediatrics. Children and COVID-19: State Data Report. https://services.aap.org/en/pages/2019-novel-coronavirus-covid-19-infections/children-and-covid-19-state-level-data-report/. Epub 26 November 2020.
  4. Kim L, Whitaker M, O’Halloran A, et al. Hospitalization Rates and Characteristics of Children Aged <18 Years Hospitalized with Laboratory-Confirmed COVID-19 — COVID-NET, 14 States, March 1–July 25, 2020. MMWR Morb Mortal Wkly Rep 2020;69:1081–1088. DOI: http://dx.doi.org/10.15585/mmwr.mm6932e3
  5. Chao JY, Derespina K, Herold BC, et al. Clinical Characteristics and Outcomes of Hospitalized and Critically Ill Children and Adolescents with Coronavirus Disease 2019 at a Tertiary Care Medical Center in New York City. J Pediatr 2020 Aug;223:14-19.e2. doi: 10.1016/j.jpeds.2020.05.006. Epub 2020 May 11.
  6. Feldstein L, Rose RB, Horwitz SM, et al. Multisystem Inflammatory Syndrome in U.S. Children and Adolescents N Eng J Med 2020 Jul 23;383(4):334-346. doi: 10.1056/NEJMoa2021680. Epub 2020 Jun 29.